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ABSTRACT 

This  paper  explores  the  error-robustness  of  phone-to-word  trans¬ 
duction  across  a  variety  of  languages.  We  implement  a  noisy 
channel  model  in  which  a  phonetic  input  stream  is  corrupted 
by  an  error  model,  and  then  transduced  back  to  words  using 
the  inverse  error  model  and  linguistic  constraints.  By  con¬ 
trolling  the  error  level,  we  are  able  to  measure  the  sensitiv¬ 
ity  of  different  languages  to  degradation  in  the  phonetic  input 
stream.  This  analysis  is  carried  further  to  measure  the  impor¬ 
tance  of  each  phone  in  each  language  individually.  We  study 
Arabic,  Chinese,  English,  German  and  Spanish,  and  find  that 
they  behave  similarly  in  this  paradigm:  in  each  case,  a  phone 
error  produces  about  1 .4  word  errors,  and  frequently  incor¬ 
rect  phones  matter  slightly  less  than  others.  In  the  absence  of 
phone  errors,  transduced  word  errors  are  still  present,  and  we 
use  the  conditional  entropy  of  words  given  phones  to  explain 
the  observed  behavior. 

Index  Terms —  Speech  recognition,  phonetic  decoding, 
transduction,  multilingual,  ASR 

1.  INTRODUCTION 

State-of-the-art  speech  recognition  systems  currently  apply 
all  the  information  sources  at  their  disposal  simultaneously 
in  the  decoding  process.  These  sources  consist  of  the  pro¬ 
nunciation  dictionary,  the  context  model  or  decision  tree,  the 
language  model,  and  the  actual  acoustic  model  or  gaussians. 
This  consolidation  is  most  complete  in  decoders  based  of  the 
Finite  State  Transducer  paradigm  [1,  2]  where  the  dictionary, 
language  model,  and  decision  tree  can  be  fully  combined  in 
advance  of  any  decoding,  but  it  is  present  in  other  decoding 
architectures  as  well,  for  example  in  the  form  of  language 
model  lookahead  [3].  While  this  strategy  is  highly  effective, 
from  the  research  point-of-view  it  may  be  easier  to  implement 
and  test  new  modeling  techniques  in  a  more  decoupled  frame¬ 
work. 

Therefore,  there  has  been  a  significant  amount  of  work  in 
recent  years  to  support  modularized  recognizers  for  research 
purposes.  In  the  FLaVoR  architecture  developed  at  Leuven 
University  [4,  5],  decoding  is  broken  into  a  two  step  pro¬ 
cess,  the  first  generating  phone  lattices  and  the  second  apply¬ 
ing  morpho-phonological  and  morpho-syntactic  constraints  to 
produce  words.  Similarly,  in  the  Automatic  Speech  Attribute 


Transcription  paradigm  [6],  it  is  proposed  that  the  recognition 
process  should  proceed  bottom  up  through  multiple  stages. 

In  an  effort  to  better  understand  the  properties  of  a  mod¬ 
ularized  system,  this  paper  studies  the  intrinsic  difficulty  of 
converting  from  phones  to  words.  The  first  stage  uses  the 
phone  set  of  [7]  and  associated  acoustic  models  to  recover  a 
one-best  phone  sequence.  The  second  stage  uses  a  finite  state 
transducer  scheme  to  recover  words  from  phones.  In  contrast 
with  previous  work  on  multi-stage  decoding,  our  work  relies 
solely  on  an  error  model  in  the  transduction  phase  to  formally 
model  the  mistakes  that  are  made  at  the  phone  recognition 
level.  The  error  model  is  an  unconstrained  model  of  IID  in¬ 
sertions,  substitutions  and  deletions,  and  more  general  than 
the  single  error  model  of  [5].  The  advantage  of  using  the  er¬ 
ror  model  approach  is  that  it  allows  us  to  directly  implement 
a  noisy  channel  model  of  speech  communication,  and  to  pose 
and  answer  a  number  of  interesting  questions.  Specifically, 
we  conduct  a  class  of  experiments  that  involves  corrupting 
a  reference  phone  sequence  with  a  known  error  model,  and 
then  measuring  our  ability  to  recover  words.  This  allows  us 
to  answer  several  questions  that  have  not  been  well  studied 
before: 

1 .  How  easy  is  it  to  recover  words  from  a  correct  but  un¬ 
segmented  phone  string,  and  how  does  this  vary  across 
languages? 

2.  As  the  phonetic  input  stream  is  corrupted  with  errors, 
how  quickly  is  our  ability  to  recover  words  degraded? 
Are  there  threshold  effects  where  a  small  number  of 
phonetic  errors  can  always  be  detected  and  recovered 
from?  How  does  this  vary  across  languages? 

3.  Are  errors  in  some  phones  more  important  that  errors 
in  others,  and  how  does  this  vary  across  languages? 

4.  How  do  the  computational  requirements  of  the  phone- 
to-word  transduction  process  vary  as  the  phonetic  input 
is  progressively  degraded? 

The  remainder  of  this  paper  is  organized  as  follows:  in 
Section  2  we  present  the  formulation  of  our  method.  Section  3 
describes  the  CallHome  dataset,  and  the  phone  recognizer  that 
was  used  for  the  different  languages.  Section  4  examines  the 
robustness  of  the  transduction  process  to  phonetic  errors,  and 
presents  an  explanation  for  the  observed  behavior.  Section 
5  addresses  the  question  of  whether  some  phones  are  more 
important  than  others.  Section  6  offers  concluding  remarks. 
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Public  reporting  burden  for  the  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources,  gathering  and 
maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this  collection  of  information, 
including  suggestions  for  reducing  this  burden,  to  Washington  Headquarters  Services,  Directorate  for  Information  Operations  and  Reports,  1215  Jefferson  Davis  Highway,  Suite  1204,  Arlington 
VA  22202-4302.  Respondents  should  be  aware  that  notwithstanding  any  other  provision  of  law,  no  person  shall  be  subject  to  a  penalty  for  failing  to  comply  with  a  collection  of  information  if  it 
does  not  display  a  currently  valid  0MB  control  number. 
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2.  FORMULATION 

In  the  noisy-channel  model  we  adopt,  we  assume  that  the 
sender  begins  with  a  sequence  of  words  he  or  she  intends  to 
communicate,  and  speaks  a  phonetic  sequence  determined  by 
the  pronunciations  of  those  words.  A  phone  recognizer  then 
processes  the  audio  and  produces  an  errorful  version  of  the 
intended  phones.  The  receiver  gets  this  corrupted  phone  se¬ 
quence  and  must  decode  the  likeliest  sequences  of  intended 
words.  This  can  be  more  precisely  stated  if  we  let  Wi  denote 
the  intended  words,  pi  denote  the  intended  phone  sequence, 
and  Pc  denote  the  corrupted  phone  sequence.  The  job  of  the 
decoder  is  then  to  determine 

argmaxP(w|pc)  =  argmaxP(w)P(pc|w) 

W  W 

=  argmaxP(w)y^P(pi,Pc|w) 

Pi 

=  argmaxP(w)y^P(pi|w)P(pc|pi,w) 
Pi 

S3  argmaxP(w)P(pi|w)P(pc|pi) 

W,Pj 

The  factors  involved  in  the  maximization  each  have  sim¬ 
ple  interpretations:  P(w)  is  given  by  the  language  model; 
P(pi|w)  is  given  by  the  pronunciation  model;  and  P(pc|Pi) 
is  given  by  the  phone- level  error  model.  In  all  the  experiments 
described  subsequently,  we  use  a  first-order  error  model  with 
insertion  and  deletion  probabilities  for  every  phone,  and  sub¬ 
stitution  probabilities  for  all  pairs  of  phones.  Table  1  illus¬ 
trates  an  example  of  our  noisy  channel  model. 

There  is  a  simple  representation  of  this  model  in  terms  of 
finite  state  transducer  operations.  Denote  the  intended  word 
sequence  by  W,  the  pronunciation  dictionary  by  P,  the  lan¬ 
guage  model  by  L,  the  error  model  by  E,  the  process  of  sam¬ 
pling  a  random  path  through  a  finite  state  acceptor  by  sample, 
and  the  process  of  finding  the  likeliest  path  by  bestpath.  Then 
the  received  (corrupted)  phone  sequence  R  is  given  hy  R  = 
sampleiyV  o  P  o  E).  The  operation  of  decoding  can  be  rep¬ 
resented  as  bestpath{R  o  E~^  o  P~^  o  L). 

Given  this  formulation,  it  is  possible  to  explore  the  ques¬ 
tions  raised  in  section  1 .  To  find  the  intrinsic  difficulty  of  re¬ 
covering  words  from  phones  in  the  error-free  case,  we  imple¬ 
ment  the  noisy  channel  model  with  an  “identity”  error  model 
that  never  inserts  or  deletes,  and  always  replaces  a  phone  by 
itself.  To  study  the  sensitivity  of  the  decoding  process  to 
phone  errors,  we  construct  error  models  with  various  error 
rates,  and  then  compute  bestpath{sample{W oPoE)o[E~^o 
P~^  o  L)).  Finally,  it  is  possible  to  explore  the  importance  of 
single  phones.  Let  Ep  be  the  original  error  model  E,  except 
that  errors  involving  phone  p  are  adjusted  to  have  zero  prob¬ 
ability.  Then  measuring  the  difference  between  using  E  and 
Ep  in  the  round-trip  process  gives  an  indication  of  the  impor¬ 
tance  of  p.  We  have  explored  the  use  of  this  methodology  in 
five  of  the  CallHome  languages  and  using  an  acoustic  model 
that  uses  a  universal  phone  set.  The  database  and  acoustic 
model  are  described  next. 


Intended  words 

I’m  sorry  we’ll  blame  him 

Intended  phones 

al  m  S  a  r  i:  w  i:  1  b  1  ei  m  H  I  m 

Corrupted  phones 

al  m  S  a  r  i:  w  i:  D  1  ei  m  H  I  m 

Recovered  words 

I’m  sorry  we  blame  him 

Table  1.  Steps  in  the  noisy  channel  model 


3.  DATABASE  AND  ACOUSTIC  MODELS 

3.1.  CallHome 

In  order  to  work  with  a  data  set  with  roughly  equal  resources 
across  a  variety  of  languages,  we  used  the  CallHome  database 
[8].  This  database  has  speech,  transcriptions,  and  lexica  in 
Egyptian  Arabic,  Mandarin  Chinese,  English,  German,  Japanese, 
and  Spanish.  The  audio  data  for  each  language  consists  of 
120  telephone  conversations  of  up  to  30  minutes  each  (100 
conversations  for  German).  Eighty  of  the  conversations  are 
marked  as  training  data  and  20  each  for  development  and  test, 
except  for  German  which  has  development  data  only.  Since 
the  experiments  did  not  involve  parameter  tuning,  and  a  test 
set  is  absent  for  German,  all  results  are  reported  on  the  de¬ 
velopment  set.  Due  to  a  high  out-of-vocabulary  rate  for  the 
Japanese  lexicon,  we  did  not  use  the  Japanese  language  data. 

3.2.  TheUPR 

To  conduct  our  experiments,  we  need  a  phone-level  error  model 
for  each  language,  reflecting  realistic  error  patterns.  To  ob¬ 
tain  these  error  models,  we  decoded  the  training  data  with 
acoustic  models  based  on  a  universal  phone  recognizer  (UPR) 
provided  by  the  Department  of  Defense  [7].  This  recognizer 
uses  259  phones  based  on  the  International  Phonetic  Alphabet 
(IPA),  and  represents  an  effort  similar  to  that  pioneered  with 
the  GlobalPhone  project  and  others  [9,  10]. 

The  UPR  system  was  built  using  the  HTK  Recognizer, 
version  3.3  and  was  trained  iteratively,  starting  with  data  that 
was  transcribed  at  the  phone  level,  and  later  incorporating 
data  that  was  transcribed  at  the  word  level.  In  the  first  stage 
of  training  the  UPR,  phonetically  transcribed  data  was  taken 
from  the  Phonetic  Switchboard  Corpus  [11,  12]  in  English, 
and  the  OGI-MLTS  Corpus  [13]  in  English,  German,  Hindi, 
Japanese,  Mandarin  Chinese,  and  Spanish.  Word-level  tran¬ 
scriptions  were  later  used  to  incorporate  data  from  LDC  data 
sets  (e.g.  CallHome  and  CallEriend)  in  a  variety  of  languages. 
The  total  amount  of  acoustic  training  data  used  in  the  five  lan¬ 
guages  studied  here  varied  from  about  15  hours  in  German  to 
88  hours  in  English.  The  overall  training  process  was  de¬ 
signed  to  ensure  that  sounds  represented  by  a  given  phone  are 
consistent  across  languages  and  that  important  phonemic  dis¬ 
tinctions  in  one  language  are  annotated  in  all  languages. 

The  UPR  acoustic  models  have  diphone  acoustic  context, 
with  17  gaussians  per  state.  The  acoustic  features  were  39- 
dimensional,  consisting  of  cepstra,  deltas  and  double-deltas, 
and  decoding  was  performed  at  the  speaker-independent  level. 
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Phone  Error  Rate:  % 


Fig.  1.  Output  word  error  rate  vs.  input  phone  error  rate 

The  UPR  makes  use  of  n-gram  phonotactic  language  mod¬ 
els  trained  on  transcripts  of  LDC  data  as  well  as  data  found 
on  the  Web.  Language-specific  phonotactic  bigram  language 
models  were  built  for  all  the  languages  used  in  our  experi¬ 
ments.  Further  details  of  the  UPR  phone  set,  acoustic  model, 
and  phonotactic  language  models  can  be  found  in  [7]. 

The  UPR  can  be  run  using  either  a  truly  universal  model 
or  using  language-specific  models.  We  used  language-specific 
models  to  decode  the  CallHome  training  data  and  create  the 
error  models.  The  phone-error  rates  on  the  test  data  varied 
from  56.4%  in  English  to  63.0%  in  German. 

4.  ROBUSTNESS  TO  PHONETIC  ERRORS 

This  section  reports  on  the  sensitivity  of  the  transduction  pro¬ 
cess  to  the  overall  error  level  in  the  input  phone  stream.  The 
experiments  all  use  a  base  error  model  that  is  obtained  by  de¬ 
coding  the  CallHome  training  data  with  the  UPR,  aligning  it 
to  the  reference  phoneme  strings,  and  computing  the  various 
substitution,  insertion  and  deletion  probabilities.  This  is  done 
separately  for  each  language.  To  obtain  error  models  at  a  vari¬ 
ety  of  absolute  error  levels,  we  then  scale  this  matrix  down  by 
moving  probability  mass  from  insertions,  deletions  and  non¬ 
identity  substitutions  to  identity  substitutions.  By  corrupting 
the  reference  phones  with  the  various  error  matrices  and  then 
measuring  our  ability  to  recover  the  correct  words,  we  deter¬ 
mine  the  sensitivity  of  the  decoding  process  to  input  errors. 

4.1.  Accuracy  and  Speed 

Figure  1  plots  transduced  word  error  rate  (WER)  as  a  func¬ 
tion  of  the  input  phone  error  rate  (PER).  To  a  first  approxi¬ 
mation,  the  two  are  related  by  WER  =  lAPER  -\-  eiang- 
The  slope  in  all  cases  is  approximately  1 .4,  and  there  is  a  lan¬ 
guage  dependent  y-intercept.  These  results  show  no  evidence 
of  redundancy  -  if  redundancy  were  present,  one  would  ex¬ 
pect  a  threshold  effect  in  which  very  low  phone-error  rates 


Phone-to-word  WER 

Entropy:  bits 

Egyptian 

0.6% 

0.0020 

German 

0.9 

0.029 

English 

2.3 

0.080 

Spanish 

5.0 

0.18 

Mandarin 

8.9  (CER) 

0.44 

Table  2.  Conditional  entropy  of  words  given  phones 


would  have  little  or  no  impact  on  word  error  rate.  In  terms 
of  runtime,  we  have  found  that  whereas  the  word  error  rate 
scales  linearly  with  phone  error  rate,  the  runtime  increases  ex¬ 
ponentially  from  less  than  one-two  thousandth  realtime  in  the 
absence  of  error  to  one-tenth  realtime  with  about  50%  phone 
error  rate.  Again,  this  is  similar  across  the  languages  studied. 
All  experiments  were  run  with  a  fixed  beam  such  that  there 
was  little  accuracy  loss  at  high  phone  error  rates. 


4.2.  Conditional  Entropy:  Explaining  the  y-intercept 

The  transduced  word  error  rate  achieved  in  the  absence  of 
any  phone  errors  is  not  zero,  and  differs  by  over  a  factor  of 
ten  from  0.6%  to  8.8%  across  the  different  languages.  To 
understand  the  observed  differences  in  the  y-intercept,  we  ex¬ 
amine  the  conditional  entropy  of  words  given  phones,  which 
can  be  computed  as  the  entropy  of  the  words  less  the  mutual 
information  between  phones  and  words.  To  define  the  mutual 
information  between  phones  and  words,  let  Cg  be  the  phone 
sequence  for  utterance  s  in  the  database.  Let  U  be  the  word 
sequence.  Note  that  sums  over  s  are  thus  over  the  observed 
data  segments.  Let  R  and  L  be  phone-sequence  and  word- 
sequence  variables  respectively  that  take  specific  values  such 
as  Tg  and  Ig.  Then 


M{L-R) 


Y,P{L,R)  log 

L,R 


P{L,R) 

P{L)P{R) 


I]  log 


PjrsAs) 

P{rs)P{ls) 


I]  log 


P(r.|L) 

Ew^’(r.|w)P(w) 


P(w)  is  given  by  the  language  model.  P(rg|w)  is  the  prob¬ 
ability  of  an  observed  phone  string  given  a  word  string.  It  is 
given  by  the  sum  over  all  the  alignments  of  Cg  to  the  phones  in 
w  of  the  probability  of  the  substitutions,  insertions  and  dele¬ 
tions  in  the  alignment,  and  can  be  computed  using  dynamic 
programming. 

The  quantity  M{L\  R)  is  a  measure  of  how  much  infor¬ 
mation  the  phones  provide  about  the  words.  If  we  let  77 (L)  be 
the  entropy  of  the  language,  then  M {L;  R)  —  H {L)  provides 
a  measure  of  the  excess  information  that  is  available  when  in¬ 
ferring  words  from  phones,  and  its  negative  is  in  fact  the  en¬ 
tropy  of  the  language  conditioned  on  knowledge  of  the  phone 
strings.  In  general,  M{L;  R)  is  difficult  to  compute  since  it 
involves  summing  over  all  possible  word  sequences  in  the  de- 
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Fig.  2.  Sensitivity  to  individual  phones 


6.  DISCUSSION 

This  paper  has  examined  the  robustness  of  phone-to-word 
transduction  in  a  variety  of  languages  and  over  a  range  of 
error  rates.  We  find  that  the  introduction  of  a  phone  error 
on  average  creates  about  1.4  word  errors,  and  this  is  seen  to 
be  constant  across  the  five  languages  studied,  and  across  a 
wide  range  of  absolute  error  levels.  At  the  level  of  individual 
phones,  the  sensitivity  to  errors  is  almost  linear  as  well,  but 
seems  to  be  optimized  in  the  sense  that  frequently  mislead¬ 
ing  phones  have  slightly  less  impact  per  error  than  their  more 
reliable  counterparts. 
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nominator.  To  simplify  the  computation,  we  have  approxi¬ 
mated  the  sum  over  all  data  segments  by  a  sum  over  the  words 
in  the  lexicon  weighted  by  their  unigram  frequency.  Essen¬ 
tially  this  uses  a  notional  data  set  consisting  of  the  words  in 
the  lexicon.  Table  2  shows  the  conditional  entropy  along  with 
the  round  trip  word  word  error  rates.  It  can  be  seen  that  there 
is  a  good  correlation  between  this  entropy  and  the  observed 
word  error  rate. 

5.  SENSITIVITY  TO  INDIVIDUAL  PHONES 

By  using  our  noisy  channel  model,  we  have  been  able  to  study 
the  sensitivity  of  word  error  rate  to  individual  phones  in  two 
ways.  The  first  uses  the  corruption  process  described  in  sec¬ 
tion  2.  Insertions,  deletions  and  substitutions  are  made  ac¬ 
cording  to  the  empirically  derived  error  model,  with  one  ex¬ 
ception:  all  errors  involving  a  particular  phone  are  excluded. 
The  corruption  process  is  run  separately  for  each  phone,  and 
the  resulting  strings  are  transduced  to  words.  The  transduced 
word  error  rate  is  then  computed,  and  we  compute  the  de¬ 
crease  in  error  rate  over  the  baseline  where  no  errors  are  ex¬ 
cluded.  To  normalize  against  frequency  effects,  we  also  count 
the  number  of  phone  errors  that  have  been  excluded  from  the 
input.  This  allows  us  to  create  a  scatterplot  of  the  number  of 
word  errors  corrected  after  transduction  against  the  number 
of  phone  errors  corrected  on  the  input  side.  This  is  shown  in 
Figure  2  for  each  phone  in  each  of  the  languages  studied. 

The  second  method  of  computing  sensitivity  to  individ¬ 
ual  phones  avoids  the  artificial  corruption  process.  This  is 
done  by  aligning  the  phone-level  UPR  output  to  the  reference 
phone  string.  Then,  for  a  particular  phone,  we  fix  all  the  er¬ 
rors  involving  the  phone.  The  remaining  steps  are  identical  to 
the  first  method,  and  we  obtain  another  scatterplot.  This  plot 
is  similar  to  that  of  Figure  2  with  somewhat  greater  disper¬ 
sion.  The  fact  that  Figure  2  is  on  a  log-log  scale  with  a  slope 
of  about  0.9,  indicates  that  there  is  a  slight  tendency  such  that 
phones  which  are  frequently  involved  in  errors  are  relatively 
downweighted. 
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